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Generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1 972) was used to estimate 
the percentage of variance explained by three sources of variability in speechreading sentences: 
the subject, the talker, and the sentence materials. Videodisc recordings of the 100 CID 
Everyday Sentences (Davis & Silverman, 1970), spoken by a male and a female talker, were 
presented to 104 subjects with normal hearing. For performance on individual sentences (total 
number of words correct), the most important systematic sources of variability were the sentence 
(26.3%), the speechreader (10.5%), the talker (4.9%), and the interaction of talker and sentence 
(5.1%). Residual error accounted for 51.2% of the variance. Generalizability functions are 
presented, as a function of test length, for five models of test administration and interpretation. 
For 10-, 50-, and 100-item lists, generalizability is predicted to be .70, .92, and .96, respectively, 
for a single talker. Psychometric characteristics of these recordings of the CID sentences are 
also presented. 

KEY WORDS: speechreading (lipreading), individual differences, assessment, visual 
speech perception, generalizability analysis 



Individual differences in speechreading have long been of interest to clinicians 
engaged in rehabilitation of adults and children with hearing impairment. Assessment 
of speechreading performance reveals wide-ranging individual differences, even after 
training (Jeffers & Barley, 1971; O'Neill & Oyer, 1981). Knowledge about the extent 
and nature of individual differences has the potential to provide clues to the basic 
cognitive and linguistic processes underlying speechreading. Recent work by cogni- 
tive psychologists and speech scientists (Dodd & Campbell, 1987; Massaro, 1987) 
suggests that visual speech processing is an important part of speech perception, 
even among those with normal hearing. Thus individual differences in speechreading 
can contribute to our understanding of speech perception in general. 

Clinical observation of large individual differences and the acknowledged impor- 
tance of assessing them notwithstanding, widely accepted, standardized tests of 
speechreading are still lacking. The research reported here has as one of its 
long-term goals the development and evaluation of a battery of tests for assessing 
speechreading in persons with normal and impaired hearing. An important condition 
for the success of this work is identification and quantification of sources of variability 
in speechreading performance. The present study examines the importance of two 
stimulus factors that can affect speechreading performance (the talker and the test 
materials) in relation to the magnitude of individual differences among subjects with 
normal hearing. 



Talker Differences 

Empirical research on talker effects has focused on nonsense syllables and 
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isolated words, which have been used primarily to obtain 
descriptive data on the visibility and confusability of pho- 
nemes. In the process, it has been demonstrated that visual 
intelligibility varies from one talker to the next and that error 
patterns reveal different categories of visual similarity among 
phonemes for different talkers (Kricos & Lesner, 1982, 1985; 
Montgomery & Jackson, 1983; Montgomery, Walden, & 
Prosek, 1987). These findings have implications for 
speechreading of sentences to the extent that phonemic 
perception per se plays a role in performance at the sentence 
level. There is little research on talker differences for sen- 
tence materials, but talker features such as facial cues, 
gestures, speaking rate, and rhythm have been shown to 
affect speechreading performance on other materials 
(Lesner, 1988). If talkers vary on these parameters, it is likely 
that their sentence intelligibility will vary also. In previous 
work on tactile supplements to speechreading we have noted 
a consistent difference in the intelligibility of two talkers 
(Bernstein, Demorest, Coulter, & O'Connell, 1991 ; Bernstein, 
Eberhardt, & Demorest, 1989; Eberhardt, Bernstein, Demor- 
est, & Goldstein, 1990), with the male talker being easier to 
speechread than the female talker, especially for sentences. 

The existence of talker differences has implications for the 
construction of speechreading tests. Test construction is 
greatly simplified if talker differences are the same in direc- 
tion and magnitude across observers. When there is a 
constant talker effect that is the same for all subjects, there is 
a main effect of talker, but there is no Subject x Talker 
interaction. Such differences in talker intelligibility are not 
problematic because separate norms can be generated for 
each talker and the correlation between scores obtained with 
different talkers can, at least in principle, be perfect. How- 
ever, if talker differences are not consistent from one 
speechreader to another, there is a Subject x Talker inter- 
action effect, which reduces the correlation (and hence 
equivalence) of scores obtained using different talkers. The 
obvious psychometric implication is that the more consistent 
the talker differences across speechreaders, the higher the 
intertalker correlations and, hence, the reliability of the test. 



Test Materials 

A comprehensive assessment of speechreading involves 
different types of test materials: nonsense syllables, isolated 
words, isolated sentences, topic-related sentences, con- 
nected discourse, and so on. Whether these different types of 
materials tap independent or highly correlated components 
of speechreading is an important topic that has not been 
resolved. Within each type of material, however, the specific 
items chosen to represent that content domain are also 
important. Just as talkers differ in their intelligibility, individual 
stimulus items differ in difficulty. A systematic approach to 
test construction must therefore begin with an evaluation of 
the magnitude of these stimulus effects and a determination 
of whether they are consistent across speechreaders. If 
items vary widely in difficulty, then item calibration is neces- 
sary for the development of test forms of equal difficulty. A 
further complication arises if the relative difficulty of items 
varies from one observer to the next. Such Subject x item 



interaction reduces the correlation between different test 
forms and thus lowers alternate-form test reliability. 

Psychometric Model 

Generaiizability theory (Cronbach, Gleser, Nanda, & Ra- 
jaratnam, 1972) is an extension of classical true-score theory 
(Allen & Yen, 1979), in which the variance of test scores is 
partitioned so that components of variance arising from 
several sources, and their interactions, can be identified. 
Generaiizability analysis begins with an experimental design 
for data collection and the corresponding statistical model for 
an analysis of variance. Mean squares from the analysis of 
variance are used both to test the statistical significance of 
main effects and their interactions and to generate variance 
estimates that represent the magnitude of these effects in the 
population. The variance estimates are extremely useful in 
test design because they indicate which sources of error 
variance are most important to control and they provide a 
basis for estimating the number of test items needed to 
achieve a desired level of generaiizability. The concept of 
generaiizability is closely related to the traditional concept of 
test reliability. 

The primary purpose of the present study was to estimate 
the percentages of variance in speechreading sentences 
accounted for by the talker and the specific sentences used, 
and to compare them to the percentage attributable to 
individual differences among speechreaders. Because tests 
are designed to assess individual differences among sub- 
jects, the variance attributable to subjects should be large in 
comparison to other components of variance. Talkers and 
test materials, on the other hand, are considered sources of 
error variance, as are interactions among subject, talker, and 
sentence. 

The sentences used for the experiment were the CID 
Everyday Sentences (Davis & Silverman, 1970). These 
sentences are widely used as speechreading materials, 
especially in the evaluation of sensory aids. Because record- 
ings of the CID Everyday Sentences can potentially be used 
as a standardized speechreading test, a second purpose of 
the experiment was to provide preliminary psychometric data 
for individual sentences and sentence lists. 

Method 

Subjects 

Subjects were 104 young adults who were recruited from 
the graduate and undergraduate student population at the 
Johns Hopkins University and paid for their participation. 
Subjects ranged in age from 16 to 37 years (M = 20.0; SD = 
3.5). All subjects reported having normal hearing and normal 
or corrected vision, and all were native speakers of English. 
Fifty-two subjects were male. 

Materials 

Stimuli were the 100 CID Everyday Sentences (Davis & 
Silverman, 1970) recorded on videodisc by a male and a 
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TABLE 1. Sequence of testing for each talker assignment and 
talker order. 



Talker 
assignment 


Talker 
order 


n 




Talker 


Male 


Female 


Set 1 


Set 2 


Set 1 Set 2 


1 


1 


26 


1st 




2nd 


1 


2 


26 


2nd 




1st 


2 


1 


26 




1st 


2nd 


2 


2 


26 




2nd 


1st 



Note. Set 1 = Lists 1-5; Set 2 = Lists 6-10. 



female talker (Bernstein & Eberhardt, 1986). The female 
talker was an actress and the male talker was a singer. Both 
spoke General American English. They were instructed to 
speak each sentence as if it had been extracted from natural 
conversation. This resulted in natural intonation patterns and 
facial expressions. Each sentence was begun from a re- 
laxed, closed-mouth position. 

Videorecordings were made using a professional-quality 
video recorder (Sony BVU 110) and camera (Hitachi Z31). 
Talkers were seated in front of a dark background, and 
lighting was provided by two face-level direct 600-W flood- 
lights, at approximately 35° to either side of the midline. Two 
600-W fill lights were positioned above the talkers. The 
talker's mouth was slightly below the center of the video 
screen, and the face filled most of the screen area. A 
teleprompter allowed the talkers to look directly into the 
camera at all times. 



Experimental Conditions 

Each subject viewed 100 sentences, 50 spoken by the 
female talker and 50 spoken by the male talker. The first 50 
sentences (Lists 1-5 from Davis & Silverman) were desig- 
nated Set 1 , and the second 50 sentences (Lists 6-10) were 
designated Set 2. 

Assignment of talkers to the two sentence sets was 
counterbalanced: Half the subjects saw the male talker for 
Set 1 (Talker Assignment 1 ), and half saw the female talker 
for Set 1 (Talker Assignment 2). The order of talkers was also 
counterbalanced: Half the subjects saw the male talker first 
(Talker Order 1), and half saw the female talker first (Talker 
Order 2). Equal numbers of male and female subjects were 
randomly assigned to each of the four groups generated by 
this 2x2 design (Talker Assignment x Talker Order). The 
resulting sequence of testing for the four groups of subjects 
is shown in Table 1 . 



Procedures 

The subject was seated approximately 8 feet from a 
1 6-inch color monitor (NEC RGB) in a darkened IAC sound- 
attenuating room. The videodisc player (Pioneer LDP 1000) 
was controlled by a DEC PDP 11/73 computer located 
outside the room. A video terminal was used to instruct the 
subject during testing and to record the subject's typewritten 
response. 
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Stimulus presentation on the first trial was initiated by the 
subject pressing a key on the terminal. After a brief pause, 
the initial frame of a sentence was presented for 2 sec, then 
the sentence was played. The videodisc player froze on the 
final frame of the sentence while the subject recorded a 
response. Subjects were instructed to type whatever they 
thought the talker had said, including sentence and word 
fragments, and they were encouraged to guess. After the 
response, there was a brief pause before the first frame of 
the next sentence appeared. Testing was conducted in two 
50-sentence blocks, one for each talker, and within each 
block the order of the five lists was independently random- 
ized for each subject. Testing took 20-30 min to complete. 

The subject's typed response was stored by the computer 
for further processing. Responses were edited for spelling 
errors, but only when there was no ambiguity regarding the 
intended response. Numerals were converted to spelled form 
and punctuation of contractions was checked for consist- 
ency. An automatic scoring program then counted the num- 
ber of words correct in each sentence. All words in the 
sentence were included in scoring, and contractions were 
counted as two words. No credit was given for partially 
correct words or for homophones, and word order had to be 
maintained for a word to be considered correct. 



Design and Analysis 

The experimental design consisted of three between- 
subjects variables and three within-subject variables. The 
three between-subjects variables were talker assignment, 
talker order, and subject gender, all of which were consid- 
ered fixed. The design was balanced, with 13 subjects per 
cell. The within-subject variables were nested hierarchically: 
set, list within set, and sentence within list, and all were 
considered random. That is, the sentences in each list were 
considered a random sample from an infinite population of 
such sentences, and the two sets of five lists were consid- 
ered a random sampling of such lists. However, given this 
random-effects model, it was necessary to employ quasi-F 
ratios (Winer, 1971) for testing all effects except those 
involving subjects and sentences. Because each subject saw 
only half the sentences for each talker, the counterbalanced 
experimental design contains confounded effects. First, the 
interaction of Talker Assignment x Talker Order is equivalent 
to the main effect of set order: Talker Assignment 1 /Talker 
Order 1 and Talker Assignment 2/Talker Order 2 both saw 
Set 1 sentences first; Talker Assignment 1 /Talker Order 2 
and Talker Assignment 2/Talker Order 1 saw Set 2 sen- 
tences first. Second, the interaction of Talker Assignment x 
Set is equivalent to the main effect of talker: Talker Assign- 
ment 1/Set 1 and Talker Assignment 2/Set 2 represent the 
male talker, whereas Talker Assignment 1/Set 2 and Talker 
Assignment 2/Set 1 represent the female talker. 

The SPSS" (1986) MANOVA procedure was used to 
obtain the mean square for each effect in the model. The 
mean squares were then used to form quasi-F ratios and to 
estimate the variance component for each effect. Variables 
with very few significant (p < .05) effects were dropped from 
the model if they accounted for < 1% of the variance in the 
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TABLE 2. Descriptive statistics for performance on the Johns Hopkins CID sentences (total 
words correct) as a function of list and talker. 



Talker 

Female Male Both 

No. 

List words M SD M SD M SD 



1 


69 


13.8 


7.3 


20.4 


8.9 


17.1 


8.8 


2 


71 


16.4 


9.8 


16.1 


9.4 


16.2 


9.6 


3 


81 


13.2 


10.4 


20.5 


10.8 


16.9 


11.2 


4 


80 


14.2 


6.7 


19.1 


9.3 


16.6 


8.5 


5 


71 


13.2 


8.8 


24.6 


9.0 


18.9 


10.6 


1-5 


372 


70.7 


37.2 


100.7 


41.9 


85.7 


42.2 


6 


75 


9.2 


5.9 


14.3 


10.3 


11.8 


8.7 


7 


76 


11.6 


7.9 


16.4 


13.0 


14.0 


11.0 


8 


72 


8.6 


6.0 


14.5 


10.8 


11.5 


9.2 


9 


74 


5.2 


3.5 


12.8 


10.2 


9.0 


8.5 


10 


80 


16.0 


7.6 


30.8 


10.0 


23.4 


11.6 


6-10 


377 


50.6 


23.8 


88.9 


48.3 


69.7 


42.5 



Note. Statistics are based on two samples, n, = n 2 = 52. Subjects with Talker Assignment 1 viewed 
the female talker for Lists 1-5 and the male talker for Lists 6-10; for the remaining subjects the 
assignment was reversed. 



dependent variable. A second analysis was then performed 
using the reduced model, and final estimation of variance 
components was based on the reduced model. Appendix A 
gives the detailed procedures for estimating variance com- 
ponents for the reduced model. 

Results and Discussion 

Descriptive Statistics for CID Sentence Lists and 
Sets 

A statistical summary of the data for individual sentences, 
for each talker, is given in Appendix B. Table 2 presents this 
information pooled across the 10 sentences within each list 
and across lists for each set. The data have also been pooled 
across the two talker orders. Overall, the subjects obtained 
an average of 1.55 words correct per sentence, which 
represents 20.8% of the 749 stimulus words. For individual 
subjects the performance range was from .01 to 4.02 words 
correct per sentence (.3% to 53.7%), and for individual 
sentences it was from .21 to 5.67 words correct (Sentences 
54 and 92, respectively). If the number of words correct is 
expressed as a percentage of the number of words in the 
sentence, the range was from 2.6% (Sentence 54) to 91 .3% 
(Sentence 40). 

Specification of a Reduced Statistical Model 

Analysis of the full experimental design, which is shown in 
Table 3, revealed that the largest experimental effects were 
for sentence, subject, Talker Assignment x Set (i.e., talker), 
and Talker Assignment x Sentence. Talker assignment, 
subject, set, and sentence were therefore retained in the 
reduced statistical model. 

Effects of subject gender were not significant except for an 
uninterpretable interaction of Gender x Sentence, which 
accounted for only .1% of the variance. Gender was there- 



fore dropped from the experimental design, and the data for 
male and female subjects were combined. 

The main effect of talker order was not significant, but 
talker order did interact with other variables. These interac- 
tions accounted for a very small percentage of variance (.8% 
altogether), so talker order was dropped from the design. 

Sentence list was not significant, and of the seven inter- 
actions involving list, only one, Talker Order x List, was 
significant [F (8, 293) = 2.1 0, p < .05]. This effect accounted 
for only .1% of the variance, and hence list was also dropped 
from the model. It should be noted that the absence of a 
sentence list effect does not imply that the list means were all 
equal, but rather that there were no unique effects associated 
with lists above and beyond those due to the sentences 
nested within them. 



Variance Components for the Reduced Model 

The reduced statistical model contained one between- 
subjects variable, talker assignment, and two within-subjects 
variables (set and sentence within set), both considered 
random. Table 4 shows the mean squares and estimated 
variance components, a 2 . The effects estimated in Table 4 
are presented below in decreasing order of magnitude. 

The largest estimated variance component, 51 .2%, was for 
the Subject x Sentence interaction. In a simpler experimen- 
tal design with only two independent variables, subject and 
sentence, this interaction would represent measurement 
error variance. The advantage of generalizability analysis is 
that it permits other potential sources of measurement error 
to be estimated as well (e.g., talker, Talker x Sentence, etc.). 
However, the variance component for the Subject x Sen- 
tence interaction can be greatly reduced in importance by 
lengthening a test. 

The particular sentence viewed was the largest systematic 
source of variance in the experiment. Its effect was significant 
[F(98, 9996) = 54.38, p < .01 ], and it accounted for 26.3% of the 
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TABLE 3. Mean squares and variance components for the full statistical model (total words 
correct). 







Estimated variance 




Mean 






Source 






% 


Between subjects 








Talker assignment (TA) 


17.94 


0.000 


0.0" 


Talker order (TO) 


110.09 


0.012 


0.1" 


Subject gender (G) 


4.16 


0.000 


0.0" 


TA x TO 


240.65 


0.053 


0.3" 


TA x G 


100.04 


0.017 


0.1 b 


TO x G 


0.65 


0.000 


0.0 b 


TA x TO x G 


51.24 


0.003 


0.0 b 


Subjects within cells (Ss) 


52.76 


0.473 


10.1 


Within subjects 








Set 


265.60 


0.018 


0.4 


List within set (L) 


166.94 


0.039 


0.8 


Sentence within set 


126.09 


1.190 


25.5 


TA x Set 8 


1,214.51 


0.448 


4.8 b 


TA x L 


45.69 


0.038 


0.4 b 


TA x Sentence 


25.60 


0.448 


4.8 b 


TO x Set 


0.19 


0.000 


0.0" 


TO x L 


7.43 


0.008 


0.1" 


TO x Sentence 


3.26 


0.018 


0.2 b 


G x Set 


0.28 


0.000 


0.0 b 


G x L 


1.58 


0.000 


0.0" 


G x Sentence 


3.00 


0.013 


0.1 b 


TA x TO x Set 


56.12 


0.038 


0.2 b 


TA x TO x L 


4.14 


0.000 


0.0" 


TA x TO x Sentence 


4.01 


0.065 


0.3 b 


TA x G x Set 


9.72 


0.003 


0.0 b 


TA x G x L 


2.79 


0.000 


0.0 b 


TA x G x Sentence 


2.63 


0.012 


0.1" 


TO x G x Set 


0.10 


0.000 


0.0 b 


TO x G x L 


2.58 


0.001 


0.0 b 


TO x G x Sentence 


2.05 


0.000 


0.0 b 


TA x TO x G x Set 


0.01 


0.000 


0.0 b 


TA x TO x TG x L 


2.02 


0.001 


0.0 b 


TA x TO x TG x Sentence 


1.68 


0.000 


0.0 b 


Ss x Set 


5.45 


0.057 


1.2 


Ss x L 


2.60 


0.027 


0.6 


Ss x Sentence 


2.33 


2.330 


49.8 



*TA x Set is equivalent to the main effect of talker. This component involves a fixed effect, and its 
variance component is adjusted in estimating the percentage of total variance it explains (see Appendix 
A). 



variance. It should be noted that this effect is averaged across 
the two talkers, and hence it pertains to the sentences as 
linguistic stimuli rather than to particular tokens. The sentence 
effect is larger than the talker effect and larger than the effect due 
to subjects. Although this result is certainly not a surprising one, 
estimation of its magnitude permits a theoretically based ap- 
proach to test construction for speechreading. 

As was noted earlier, because the goal of testing is 
assessment of individual differences, it is desirable for the 
variance attributable to subjects to be large in comparison to 
other components of variance. In this analysis, the main 
effect of subjects accounted for 10.5% of the variance. A 
psychometric interpretation of this value is that if a subject 
were given a randomly selected sentence from one of the two 
talkers and we used the number of words correct on that 
sentence as an estimate of the subject's expected score 
across the two talkers and all sentences, the generalizability 
of that score would be only .105. That is, compared to the 



other sources of variability in speechreading single sen- 
tences, differences among the subjects accounted for only 
10.5% of the variance. As will be shown below, the effects of 
these other sources can be controlled in test construction by 
increasing the length of the test and the number of talkers 
represented. 

The two Interactions of sentence materials with talker 
assignment were the next largest effects in the experiment. 
The interaction of Talker Assignment x Set is formally 
identical to the main effect of talker. For Talker 1 , the woman, 
speechreading performance averaged 1.21 words correct 
per sentence, whereas for Talker 2, the man, the average 
was 1 .90. This difference was significant [F (1,119) = 39.64, 
p < .01] and accounted for 4.9% of the variance. Of greater 
interest than the main effect of talker, however, is the 
interaction of Talker Assignment x Sentence. The interaction 
was significant [F(98, 9996) = 11.45, p < .01], and the 
estimated variance explained was 5.1%. This effect may be 



TABLE 4. Mean squares and variance components for the 
reduced statistical model (total words correct). 



Estimated 
Variance 





Mean 


i 2 




Source 


square 


% 


Between subjects 






0.0" 


Talker assignment (TA) 


17.94 


0.000 


Subjects within TA (Ss) 


54.63 


0.489 


10.5 


Within subjects 








Set 


265.60 


0.026 


0.5 


Sentence within Set 


129.42 


1.222 


26.3 


TA x Set 8 


1,214.51 


0.456 


4.9 b 


TA x Sentence 


27.24 


0.478 


5.1" 


Ss x Set 


5.78 


0.068 


1.5 


Ss x Sentence 


2.38 


2.380 


51.2 



TA x Set is equivalent to the main effect of talker. This component 
involves a fixed effect, and its variance component is divided by 2 in 
estimating the percentage of total variance it explains (see Appendix 
A). 



interpreted as an interaction between talker and sentence, 
and it indicates that the talker differences were not constant 
from one sentence to the next. For example, on Sentence 1 , 
Walking's my favorite exercise, the difference between the 
talkers was reversed (M = 1 .08 for the woman and M = .67 
for the man). Together, the two talker effects accounted for 
10.0% of the variance in the speechreading scores. 

There was no effect of talker assignment. Overall perfor- 
mance was not affected by the assignment of talkers to 
sentence sets. Because the F ratio was < 1 , the estimated 
variance component was negative and was set to zero. 
There was also no effect of sentence set [F (1 , 103) = 2.00, 
p > .05]. The mean number of words correct per sentence 
was 1 .71 for Set 1 and 1 .40 for Set 2. The estimated variance 
component for this source of variance was therefore quite 
small, accounting for only .5% of the variance. It is encour- 
aging to note that even though the interaction of Subject x 
Set was significant [F (102, 9996) = 2.43, p < .01], it 
accounted for a rather small percentage of variance (1.5%). 
This implies only minor differences among subjects in the 
relative difficulty of the two sentence sets. 

The significant and substantial effects of sentence and 
Subject x Sentence indicate that the number of words 
correct varied from one sentence to the next and, for a given 
sentence, from one subject to the next. One obvious possible 
influence on the total number of words correct In a sentence 
is sentence length. The longer the sentence, the greater the 
number of possible words correct. It is possible, therefore, 
that the apparent differences among sentences are an arti- 
fact of defining the dependent variable as the total number of 
words correct, rather than as the percentage of words 
correct. To check this possibility, all analyses were repeated 
using the percentage of words correct in each sentence as 
the performance measure. Table 5 gives the results of the 
variance component analysis. 

The results were essentially the same when the dependent 
variable was the percentage of words correct rather than the 
number of words correct. With this method of controlling for 
sentence length, the estimated variance components are 
virtually identical to those presented in Table 4. This sug- 
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TABLE 5. Mean squares and variance components for the 
reduced statistical model (% words correct). 



Estimated 
Variance 





Mean 


6* 




Source 


square 


% 


Between subjects 








Talker assignment (TA) 


1,811.53 


0.00 


0.0" 


Subjects within TA (Ss) 


11,149.56 


100.45 


9.4 


Within subjects 








Set 


82,016.63 


9.17 


0.9 


Sentence within Set 


33,730.25 


319.28 


30.0 


TA x Set 8 


245,689.53 


91.96 


4.3 b 


TA x Sentence 


6,011.53 


105.50 


5.0" 


Ss x Set 


1,104.42 


11.58 


1.1 


Ss x Sentence 


525.43 


525.43 


49.4 



TA x Set is equivalent to the main effect of talker. This component 
involves a fixed effect, and its variance component is divided by 2 in 
estimating the percentage of total variance it explains (see Appendix 
A). 



gests that the differences among sentences are a function 
not of differences in sentence length, but of other factors that 
influence sentence difficulty. Inspection of the statistics for 
individual sentences in Appendix B shows that performance 
varies on both long and short sentences. Among three-word 
sentences, for example, consider I'm sorry (M = 2.74) versus 
Children like candy (M = .21) or Here we go (M = .98). 
Among the longer sentences, there is also considerable 
variability, as in Don't forget to pay your bill before the first of 
the month (M = 3.76) versus Fishing in a mountain stream is 
my idea of a good time {M = .74). Clearly, for these 
sentences, sentence length per se is not a strong determiner 
of speechreading performance. 



Implications of the Variance Component Analysis 
for Test Construction 

In generalizability theory there is a distinction between an 
individual's observed score on a particular test and the score that 
individual would be expected to obtain, on the average, under 
specified conditions of testing. The latter is termed a universe 
score and it is conceptually similar to a true score, as defined in 
classical test theory. The universe score can be determined from 
the interpretation given to the observed score. For example, if a 
subject's score on a particular list of sentences, by a particular 
talker, is interpreted as an estimate of the subject's score across 
all such test lists and talkers, then the universe score is defined 
as the subject's average score across all lists and talkers. If the 
subject's observed score is interpreted as a talker-specific score, 
then the universe score is the subject's average score across 
lists for that talker. 

It can be shown that the variance of observed test scores 
consists of the variance of universe scores plus variance 
attributable to variables that affect the observed score, but 
not the universe score. These extraneous variables are 
considered sources of measurement error. For a given test 
situation, it is possible to estimate a coefficient of generaliz- 
ability that equals the ratio of universe-score variance to 
observed-score variance. A coefficient of generalizability is 
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FIGURE 1 . Expected value of the coefficient of generallzabllity as a function of test length (number of sentences) 
for five models of test administration and interpretation. 



directly analogous to a reliability coefficient, as defined in 
classical theory (see Appendix A). The statistical principles 
underlying estimation of generalizability coefficients are 
given in detail by Cronbach et al. (1972) and are summarized 
for the present study in Appendix A. 

In the sections that follow, several models of testing 
speechreading and of interpreting speechreading test scores 
are explored. For each model, estimated variance compo- 
nents from the present study are then used to predict 
generalizability for tests of various lengths. 

Testing with one test and one talker. Assume that a 
speechreading test is constructed of k sentences spoken by 
one talker and that a subject's speechreading performance is 
defined as the average number of words correct on the k 
sentences. The subject's universe score is conceptualized as 
the expected score across all sentences by this talker. 

As Appendix A shows, observed-score variance contains a 
component that is not present in universe-score variance: 
variance due to the interaction of subjects with the particular 
sentences in the test list. As test length increases, however, 
this variance component is reduced by a factor of 1/fc The 
universe-score variance does not change with k, so when it is 
expressed as a proportion of the observed-score variance, 
the proportion increases as k increases. Based on the data 
from this experiment, Figure 1 shows the expected coeffi- 
cients of generalizability for values of fcfrom 5 to 100 (Model 
1). Generalizability increases rapidly with test length and 
reaches .90 by the time k is 40, suggesting that relatively few 
sentences are needed to obtain an acceptably reliable mea- 
sure. For k = 100, generalizability is estimated to be .96. 
Although individual sentences differ markedly from one an- 



other in difficulty, the average score across sentences (for k 
> 40) is not highly influenced by the particular sentences 
chosen. 

Model 1 represents the practice of standardizing test 
materials with a given talker and restricting test interpretation 
to expected performance across sentences with that talker. 
However, it is also possible to generalize beyond the single 
talker and to conceptualize an individual's universe score as 
the expected score across sentences and talkers. If all 
talkers were equally difficult to speechread, and if there were 
no Subject x Talker Interaction, this interpretation would be 
justified even though testing had taken place with only one 
talker. However, if there are talker effects, this generalization 
is less accurate than for Model 1, because the observed 
score does not include both talkers. The curve for Model 2 in 
Figure 1, in which generalization from one talker to two 
talkers is made, illustrates this point. Although generalizabil- 
ity can be very good (> .80), it cannot be perfect. For the 
present data, generalizability approaches .88 as an asymp- 
tote for Model 2. 

It is important to note that the difference between Models 1 
and 2 lies in the interpretation of the test scores, not in the 
method used to obtain them. For both models, all subjects 
view the same talker. For Model 1 , test score interpretation is 
based on that particular talker; for Model 2, interpretation is 
generalized across talkers. This emphasis on test score 
interpretation, as well as test design, is an important contri- 
bution from generalizability theory. 

Testing with one test and more than one talker. A third 
approach to testing and interpretation is represented as 
Model 3 in Figure 1 . In Model 3, some subjects are tested 
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with one talker and others are tested with a different talker. 
This is the situation that arises when there are different 
recordings of the same materials. Test score interpretation is 
the same as for Model 2: The individual's universe score is 
conceptualized as the expected score across sentences and 
talkers. The generalizability of test scores under this model, 
however, is comparatively low, approaching an estimated 
asymptote of only .62. The poor generalizability of Model 3 
derives from two sources: generalizing across talkers after 
testing a given subject with only one talker, and testing 
different subjects with different talkers. This testing model is 
not uncommon in clinical practice, and it also characterizes 
clinical research studies in which more than one talker is 
used for testing subjects. 

In the next model to be considered (Model 4), each talker 
presents half of the k sentences and generalization is across 
both talkers. (These examples concern only two talkers 
because only two talkers were represented in the present 
study. The same principles could be applied with more than 
two talkers, however.) This model differs from Model 3 in that 
generalization is consistent with the method of testing (i.e., 
both talkers are used) and all subjects are tested in the same 
way. Interestingly, this design produces a generalizability 
function identical to that for Model 2. Although the models 
themselves are distinct, with different observed scores and 
universe scores (see Tables A-3 and A-5), the generalizabil- 
ity coefficients they produce are the same. The reason for 
this result is that variability among subjects is expected to be 
the same under both methods of testing, even if the test 
scores themselves are not (cf. expressions for variance in 
Tables A-4 and A-5). Nevertheless, there is an advantage to 
including more than one talker. The observed score X is a 
biased estimator of the universe score across talkers when 
only a single talker (Model 2) is used (cf. expressions for the 
universe score, U, in Table A-5). Thus if the test score is to be 
generalized across talkers, it is important to represent those 
talkers in the test. 

Testing with more than one test. In each of the models 
presented above, the same test items (i.e., sentences) were 
presented to all subjects. If different subjects receive different 
sets of k items, generalizability is adversely affected (com- 
pared to Model 1). As in Model 1, the subject's universe 
score is conceptualized as the expected score across all 
sentences by a given talker. The curve for Model 5 in Figure 
1 was estimated using the variance components from the 
original statistical model that included lists as an independent 
variable. Although universe-score variance is the same for 
Models 1 and 5, the variance of observed scores is increased 
in Model 5 because subjects do not all view the same list of 
sentences. As a consequence, generalizability is lower than 
for Model 1 and approaches an estimated asymptote of .84. 

Summary. The five models depicted in Figure 1 illustrate 
how the variance components obtained from an experimental 
investigation of sources of variability in test scores can be 
used to develop guidelines for test construction. For the 
present data on CID sentences, obtained from subjects with 
normal hearing, it is estimated that a highly reliable test (> 
.90) can be constructed from 50 sentences, provided all 
subjects receive the same test items and the score is 
interpreted as specific to a particular talker. If assessment of 



individual differences is the primary purpose of testing, it is 
not necessary to include more than a single talker (Model 1), 
but the score must be interpreted as a talker-specific score. 
Testing different subjects with different lists and a single 
talker reduces generalizability, compared to holding list con- 
stant (Model 5 vs. Model 1), but testing with different talkers 
and a given list has even more deleterious effects (Model 3 
vs. Model 1). When the test score is generalized across 
talkers, it is possible to have high generalizability even with a 
single talker (Model 2 & Model 4), provided all subjects 
receive the same test items (cf. Model 3); but to obtain an 
unbiased estimate of an individual's universe score across 
talkers, it is necessary to include more than one talker. 



Psychometric Evaluation of the CID Sentence 
Lists 

The CID Everyday Sentences, as presented by Davis and 
Silverman (1970), were organized into 10 lists of 10 sen- 
tences each. These lists each contained 50 key words and 
were designed to provide materials for testing auditory 
speech perception. The data from this experiment provide 
useful information regarding these recordings of the CID 
sentences as materials for assessing speechreading. 

Descriptive statistical information for the 1 0 lists and for the 
combination of Lists 1-5 and 6-10 (i.e., Set 1 and Set 2) 
were presented in Table 2. Although list, per se, was not a 
significant source of variability in the variance component 
analysis, list-to-list variability in the means is evident in Table 
2, as are differences between the two sets. This suggests 
that, on the average, the items that make up the 10 lists and 
the two sets are not equally difficult. The information in 
Appendix B can be used to combine items and/or lists so as 
to achieve sentence sets equated for level of difficulty. 
Although these means are based on the responses of 
individuals with normal hearing, it is not unreasonable to 
expect the relative difficulty of the sentences to be similar 
among hearing-impaired speechreaders. 

In Model 1 it is assumed that all subjects are tested with a 
single set of sentences spoken by one talker, and that the 
universe score is the expected score on all sentences for that 
talker. The values in Figure 1 for Model 1 represent a special 
case in which the generalizability coefficient is identical to 
coefficient a. Coefficient a is a measure of internal consist- 
ency reliability, and it can also be interpreted as the expected 
correlation of a given test with an alternate test form of the 
same length. Based on Figure 1 , Model 1 , the expected value 
of coefficient a for a 10-item list and the expected correlation 
between a pair of 10-item lists is .70. Table 6 gives actual 
values of coefficient a for each sentence list and for each 
talker. The average value of a for these 1 0-item sentence 
lists is .67, a result that accords well with the prediction, 
although some of the coefficients are higher than .70, and for 
Lists 8 and 9, for the female talker, they are considerably 
lower. For the 50-item sentence sets, a averages .90, which 
is also in agreement with the theoretically predicted value 
from Figure 1 (.92). Coefficient a cannot be calculated for all 
100 sentences pooled across talkers because individual 
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TABLE 6. Coefficient a and standard error of measurement for 
lists of CID sentences for two talkers. 



Talker 

Female Male 



List 


a 


s. 


a 


». 


1 


.64 


4.4 


.63 


5.4 


2 


.75 


4.9 


.72 


5.0 


3 


.73 


5.4 


.69 


6.1 


4 


.63 


4.1 


.71 


5.0 


5 


.70 


4.8 


.84 


4.9 


1-5 


.92 


10.7 


.92 


11.8 


6 


.62 


3.7 


.71 


5.6 


7 


.60 


5.0 


.84 


5.3 


8 


.47 


4.3 


.77 


5.1 


9 


.23 


3.0 


.82 


4.3 


10 


.56 


5.1 


.75 


5.0 


6-10 


.84 


9.7 


.94 


11.4 



Note. Statistics are based on two samples, n, = n z = 52. Subjects 
with Talker Assignment 1 viewed the female talker for Lists 1-5 and 
the male talker for Lists 6-10; for the remaining subjects the 
assignment was reversed. 



differences among subjects are confounded with talker dif- 
ferences. 

The standard errors of measurement based on Model 1 
are also given in Table 6. T hese estimates were obtained 
using the formula S e = SDV 1 -a, with values of SD ob- 
tained from Table 2. Although reliability coefficients, such as 
coefficient a, are sensitive to the range of individual differ- 
ences within a particular population or subpopulation, S e is 
relatively stable, provided measurement error is uncorrelated 
with an individual's true (or universe) score. This implies that 
the values of S e in Table 6 would probably generalize to a 
population of hearing-impaired speechreaders even if their 
overall level of performance differed from that of the normal- 
hearing subjects studied here. 



General Discussion 

Three sources of variability in speechreading sentences 
were examined in this study: subjects, talkers, and sen- 
tences. Although sentences were found to be a major source 
of variability, and the interaction of Subject x Sentence (i.e., 
residual error) was even greater, it was shown that both of 
these effects can be controlled if tests are constructed from a 
sufficient number of items. For each of the test models 
considered, once k = 30 or 40 items, generalizability is quite 
near its asymptotic value. 

Comparisons among the test models presented reveal that 
generalizability is greater when test interpretation is consis- 
tent with the method of testing. For example, if one talker is 
used, generalizability is higher if the test score is interpreted 
as a talker-specific score than if it is interpreted as an 
estimate of the individual's score across talkers (e.g., Model 
1 vs. Model 2). Another implication is that generalizability is 
greater when all subjects are tested in the same manner. 
Variations in test administration, such as those that arise 
when there are different talkers for different subjects or 



different test lists for different subjects, increase observed- 
score variance and hence lower generalizability. However, 
given that talker effects were greater in the present study 
than list effects, employing different talkers (Model 3) lowers 
generalizability more than using different lists (Model 5). 

Although the results plotted in Figure 1 for Models 2 and 4 
are identical, it is important to understand the difference 
between the two. Equal generalizability implies that individual 
differences, per se, are equally well assessed under both 
models. However, in many test situations, particularly when 
sensory aids to speechreading are being evaluated, it is also 
important to be able to estimate a subject's absolute level of 
performance. It is possible to estimate individual differences 
among subjects using a relatively easy talker or a relatively 
difficult talker. However, if it is important to estimate the 
subject's actual performance level across two or more talk- 
ers, then relying on a score from a single talker can be very 
misleading. Thus the universe of generalization should be 
considered when a strategy for testing is being planned. 

Although the present study provides useful information 
about three sources of variability in speechreading sen- 
tences, the results are specific to the Johns Hopkins video- 
recordings of the CID sentences and, hence, to those two 
talkers. Whether similar results can be obtained with other 
sentence materials for these talkers is currently under inves- 
tigation. However, the generalizability coefficients in Figure 1 
are similar in magnitude to alternate-form reliability coeffi- 
cients reported in the literature. Thus, it is not unreasonable 
to expect that these results will be replicated with other 
sentence materials. As for generalizability to other talkers, it 
is likely that the present results would generalize to other 
talkers who are selected for their presumed visual intelligibil- 
ity and who are instructed to speak clearly and distinctly, yet 
naturally. This prediction will be tested in the future, as 
videorecordings become available for additional talkers. 

An important potential source of variability in speechread- 
ing scores that was not addressed in this study is the test 
occasion. All testing was done in a single session, and thus 
there was no opportunity for day-to-day variations in atten- 
tion, mood, motivation, and so on, to affect the observed test 
scores. This is an important issue for clinical testing and for 
longitudinal studies in which testing takes place over time. 
Likewise, retesting with the same test items was not per- 
formed here, so the results do not address the important 
issue of possible practice effects. 

A final cautionary note concerns the population sampled in this 
study. Although the results demonstrate significant individual 
differences in speechreading among subjects with normal hear- 
ing, it is likely that the variance estimates in Tables 3-5 would be 
different if the sample had also included individuals with hearing 
impairment. With a more heterogeneous population the variance 
attributable to subjects would be larger and, other things being 
equal, all generalizability coefficients would be raised. Thus the 
estimates of generalizability presented here probably represent a 
lower limit on the generalizability to be expected for the general 
population. It is more difficult, however, to predict the results that 
would be obtained from a hearing-impaired population per se. If 
hearing-impaired speechreaders simply had higher scores, on 
the average, than normal-hearing subjects, but with a similar 
range of performance, the present results might be replicated. 
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However, the possibility exists that those with hearing impair- 
ment approach speechreading differently, perhaps because mo- 
tivation, experience, or training, and thus the number and types 
of errors they make would be quite different. Therefore it is not 
appropriate to assume that the present results would be repli- 
cated in a clinical population of adults with impaired hearing. 

The analyses presented here demonstrate that test construc- 
tion in a particular domain, such as speechreading, can be 
informed by a quantitative analysis of sources of variability in test 
scores. By using this statistical approach with different types of 
stimulus materials, and in combination with a qualitative analysis 
of the nature of the errors that subjects make, it should be 
possible to make considerable progress toward understanding 
multiple aspects of speechreading performance and constructing 
a battery of measures to assess them. 
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No. 


Source of varlabllHv 


levels 


a. 2 


FynACf aH villus of mean aaij&re 

bAUSVIwU V C1IU w III lllvBII VUUQI V 


1 


Between subjects 
Talker assignment (TA) 


a 


a, 2 

u 1 


beda, 2 + cdo- 2 2 + bdcr 5 2 + b<T 6 2 + da 7 2 + <r 8 2 + a e 2 


2 


Subject within TA 
Within subjects 


b 


(T 2 2 


cdcr 2 2 + d(T 7 2 + <r 8 2 + <i B 2 


3 


Set 


c 




abd(r 3 2 + ab(T 4 2 + dtr 7 2 + o- 8 2 + o- e 2 


4 


Sentence within Set 


d 


*4 2 


aba 4 2 + cr 8 2 + a„ 2 


5 


TA x Set* 




*5 2 


bda 5 2 + ba 6 2 + da 7 2 + ct 8 2 + a/ 


6 


TA x Sentence 




».* 


b^e 2 + CT a 2 + <r 0 2 


7 


Subject x Set 






d<r 7 2 + a B 2 + a e 2 


8 


Subject x Sentence 






cr 8 2 + a e 2 



a TA x Set is equivalent to the main effect of talker. 



Appendix A 

Generalizability Analysis 



Generalizability theory (Cronbach et al., 1972) is an extension of 
classical true-score theory. The classical model specifies that the 
test score, X, obtained for a given individual, consists of two 
components, a true score, T, and error of measurement, e. In a 
population of individuals, the variance of Xis assumed to be the sum 
of the variance of Tand the variance of e: 

a? = <r t 2 + <r 2 . (1) 



Reliability is defined as the ratio of true-score variance to the 
variance of X: a, 2 /o- x 2 . Reliability is the proportion of X variance that 
is due to variance in true scores. 

In generalizability theory, the model for X is expanded so as to 
explicitly represent several potential sources of variability. A gener- 
alizability study is an experiment that evaluates the contribution of 
each of these sources to the variance of X. The structural model for 
X is based on the experimental design used. The next section 
presents the structural model for X for the present study and 
indicates how the mean squares from analysis of variance can be 
used to estimate the variance explained by each effect. 

Estimation of Variance Components 

The reduced model for this experiment contains the between- 
subjects effect of talker assignment and the within-subjects effects of 
sentence set and sentence within set. The first two columns of Table 
A-1 identify the eight main effects and interactions that are potential 
sources of variability in X, the subject's score on a particular 
sentence. 

The observed score X is modeled as a grand mean (p.) plus 
several effects (al), plus residual error (e): 

X=/A + ai+a2 + a3 + a4 + a5 + ae+a7 + £ *8 +e - (2) 



Each a term represents one of the main effects or interactions in 
Table A-1. For example, a, represents the main effect of talker 
assignment and a s represents the Talker Assignment x Set inter- 
action effect (which is equivalent to the main effect of talker). The 
number of levels of each independent variable is represented in 
Column 3. For this study, a = 2 (talker assignments), b = 52 
(subjects within each talker assignment), c = 2 (sentence sets), and 
d = 50 (sentences within each set). The variance component 



corresponding to each of these effects, o- 2 , is designated with the 
same subscript as the effect. 

The expected value of each mean square (MS) in the analysis of 
variance Is represented in the rightmost column of the table. These 
expected values were generated using an algorithm presented in 
Winer (1971, pp. 371-375), with talker assignment considered a 
fixed effect and all other effects considered random. To estimate the 
variance component for a particular effect, a 2 , mean squares from 
the analysis of variance are added and subtracted so as to produce 
an estimator whose expected value is a, 2 . For example, for cr 7 2 , the 
estimator is (MS, - MSg)/d and for a 5 2 it is (MSs - MSg - MS? + 
MSg)/bd. The formulas for estimating each component are given in 
Table A-2. 

The total variance of X is a weighted sum of the variance 
components (Dwyer, 1974), which incorporates a correction for the 
fixed effects in the model. For the present design, the total variance 
of Xis given by: 

fx 2 = (1 - MaW + ov, 2 + cr 3 2 + a/ + (1 - 1/a)o- 5 2 



+ (1 - 1/a)cr 6 2 + <r 7 2 + o-e 2 + cr e 2 . (3) 



The weights, (1 - 1/a), appear with each component that involves 
talker assignment and equal 1 /s for a = 2. To express the percentage 
of total variance explained by a particular effect, the weighted 
variance component for that effect is divided by the total variance 
and multiplied by 100. 

Generalizability Coefficients 

In generalizability theory the concept of a true score is replaced 
with a universe score, which represents a universe of generalization 
used in test-score interpretation. For example, an individual's ex- 
pected score across all sentences for a given talker is not necessar- 
ily the same as the individual's expected score over both sentences 
and talkers. A generalizability coefficient (symbolized p 2 ) is the ratio 
of universe-score variance to observed-score variance (O and, 
hence, is analogous to a reliability coefficient as defined in the 
classical theory. 

To derive the formula for a generalizability coefficient it is neces- 
sary to specify how the test will be used in practice (e.g., the number 
of test items, number of talkers, etc.) and how the test scores will be 
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887 



No. Source of variability Estimator 



Between subjects 



1 


Talker assignment (TA) 


(MS, - 


MSz - MS S + MS 7 )/bcd 


2 


Subject within TA 
Within subjects 


(MSz- 


MSJ/cd 


3 


Set 


(MSs- 


MS 4 - MS? + MS^/abd 
MSe)/ab 


4 


Sentence within Set 


(MS 4 - 


5 


TA x Set" 


(MS 5 - 


MSg- MSr+ MSJ/bd 


6 


TA x Sentence 


(MSe- 


MSs)/b 


7 


Subject x Set 


{M&- 
MSs* 


MSy/d 


8 


Subject x Sentence 





"TA x Set is equivalent to the main effect of talker. The interaction of Subject x Sentence cannot be 
estimated independently of residual error, so these components are pooled and estimated jointly. 



interpreted (i.e., the universe of generalization). Estimation of a 
generalizability coefficient involves several steps: 

1 . A structural model for X is specified and the composition of 
observed-score variance is derived. This model is usually not the 
same as the model used for estimating the variance components. 
Rather, it reflects a design for actual use of the test, and observed 
variance depends on this design. For example, if all individuals are 
tested with a single test form and one talker, observed variance is 
less than if some are tested with one test form and others with 
another. 

2. A universe of generalization is specified and the structural 
model for an individual's universe score is derived. The universe 
score is the individual's expected score. For example, if X is 
interpreted as an estimate of performance across talkers, the uni- 
verse score is the subject's expected score across talkers. 

3. The composition of universe-score variance is derived in terms 
of the structural model. 

4. The generalizability coefficient is formed as the ratio of uni- 
verse-score variance to observed-score variance. 

5. Finally, the estimated variance components from the general- 
izability study are substituted in the formula for the generalizability 
coefficient. The generalizability coefficients presented in Figure 1 
were generated for five models, each of which represents a different 
design for testing and/or score interpretation. Steps 1-4 are pre- 
sented for Models 1-4. The equations presented for each model 
below are summarized in Tables A-3 to A-6 to facilitate direct 
comparisons between models. 

Model 1. In this model all subjects are tested with the same set of 
/(items spoken by one talker, and the test score is the mean number 
of words correct across the k items. The observed score X has the 
following model: 

X= fx. + <*| + ct2 + as + latjk + a 5 + la^k 



+ a 7 + tas/k + Xe/k. (4) 

This model differs from the model used in the estimation of variance 
components because effects that involve sentences are averaged 
effects across the /(sentences. The effects of talker assignment (a,), 
set (a 3 ), sentence (a 4 ), Talker Assignment x Set (a 5 ), and Talker 



Assignment x Sentence (a 6 ) are constant across subjects and 
hence do not affect the observed variance. The expected observed 
variance for the scores in this model is: 

a x 2 = <r 2 2 + cr 7 2 + <r e 2 /k + <r e 2 /k. (5) 

Because the universe score involves generalization across sen- 
tences, a subject's universe score contains no effects for sentences 
or their interactions with other factors. A subject's universe score (U) 
is therefore: 

E[X] = U = n + a-\ + otz + <*3 + as + ctj. (6) 

The effects of talker assignment (a,), set (a 3 ), and Talker Assign- 
ment x Set (a s ) are constant, so they do not contribute to universe- 
score variance, which is given by: 

a 2 = o- 2 2 + ay 2 . (7) 

Note that this variance includes two components, one for the 
individual differences among subjects (<r 2 2 ) and one for the interac- 
tion of Subject x Set (o> 2 ). In the present experimental design, the 
latter is equivalent to an interaction of Subject x Talker. 

The ratio of the universe-score variance to the observed variance 
defines the generalizability coefficient: 

a 2 2 + of 

^ = o- 2 2 + 07 2 + <r e 2 /k+ a 2 lk (8) 

The estimated variance components in Table 4 were substituted in 
Equation 8 to obtain estimates of generalizability for the CIO sen- 
tences. The resulting values are plotted in Figure 1 (Model 1) as a 
function of k, the number of items. 

Model 2. Suppose the methods of testing are the same as in Model 1 , 
but that the universe score is conceptualized as the expected score 
across sentences and talkers. The difference between Model 1 and 
Model 2 lies in the interpretation of the test score. In Model 1 it is 



TABLE A-3. Structural model for an observed score under four models of testing. 
Testing 

model Structural model for observed score (X) 

1 \l + a, + a 2 + a 3 + lajk + a 5 + Idg/fr + a 7 + Xa^k + te/k 

2 \l + a, + a 2 + a 3 + lajk + a s + Xa^/k + a 7 + la^k + 2e/k 

3 (A + a, + a 2 + a 3 + Xajk + a 5 + log/k + a 7 + Xag/k + 2e/k 
4 n, + a 2 + a 3 + lajk + d s + Xag/k + a 7 + lag/k + Xe/k 
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TABLE A-4. Expected variance of observed scores under four 
models of testing. 



Testing 
model 



1 




+ 


°7 2 


+ 


<r B 2 /k + <r 2 /k 


2 


"a 2 


+ 


°f 


+ 


(j B z /k + <r e 2 lk 


3 


"i 2 


+ 


<T 2 2 


+ 


tr 5 2 + <j B 2 lk + a 7 2 + a B 2 /k + <j e 2 /k 


4 


<x 2 2 


+ 


°7* 


+ 


u B 2 /k + a 2 /k 



interpreted as a talker-specific score, whereas in Model 2 it is interpreted 
as an estimate of the subject's expected score across talkers. 

Observed variance is the same in Model 1 and Model 2. In Model 
2, however, a subject's universe score is: 



E[X] = U = m + ^ + a 2 . 



0) 



Because talker assignment (a,) is constant for all subjects, it does 
not contribute to universe-score variance (tr u 2 ), which simply equals 
a 2 2 . The generalizability coefficient for Model 2 is therefore: 



_2 _ 



o- 2 ' 



(10) 



Comparison of the formulas for p 2 in Model 1 and Model 2 (see Table 
A-6) reveals that, unless <j 7 2 = 0, generalizability is necessarily tower for 
Model 2 because scores obtained on only a single talker are generalized 
across talkers. Although two of the terms in the denominator of p 2 
approach zero as approaches infinity, the component for the interaction 
of Subject x Set (i.e., Subject x Talker) does not. The magnitude of this 



TABLE A-5. Structural model for a universe score and universe- 
score variance under four models of testing. 



Testing 
model 



Universe score ((/) 



1 


p. + a, + 


a 2 + a 3 + a 5 + a 7 


<J 2 2 +0> 2 


2 


p. + a, + 


a 2 


a 2 2 


3 


p. + a 2 




a 2 2 


4 


p, + a 2 




a 2 2 



component therefore places a limit on the generalizability that can be 
attained under Model 2. 

Model 3. A different method of data collection is represented by 
Model 3. Here, each subject receives the same k test items, but 
some subjects are tested with Talker 1 and other subjects are tested 
with Talker 2. The observed score of a given subject has the same 
components as in the previous models, but the population mean of 
the observed scores no longer contains effects for talker assignment 
or its interaction with set. These effects contribute to observed 
variance because talker assignment is no longer constant across 
subjects. As a consequence, expected observed variance is: 



2 2 
0x =fi 



o- 2 2 + o- 5 2 + <r 6 2 /k + ay 2 + <r B 2 /k + a e 2 /k. (11) 

The effects of talker assignment (a., 2 ), the interaction of talker 
assignment and sentence set (<r 5 2 ), and the interaction of subject 
and sentence set (o> 2 ) all contribute to observed-score variance. In 
this study the last two effects are equivalent to the main effect of 
talker and the interaction of Subject x Talker, respectively. 

A subject's universe score is the expected value of X across 
sentences and talkers, so it contains no effects for these variables: 



E[X] = U = fjL + a 2 . 



(12) 



Thus, the variance of universe scores, <r u 2 , is simply cr 2 2 , and the 
generalizability coefficient is given by: 



(13) 



> °* 

o-! 2 + (J2 2 + 06 2 + cr 6 2 /k + 07 2 + <r e 2 /k + a B 2 /k 

As with Model 2, there are variance components in the denominator 
of this generalizability coefficient that are not affected by the length of 
the test (k). Comparison of p 2 for Models 2 and 3 (see Table A-6) 
shows that generalizability for Model 3 is always lower than for Model 
2, unless a, 2 , a 5 2 , and <r 6 2 are a" z ^ n >- 

Model 4. Given that generalization across talkers limits the 
generalizability coefficient when testing is done with only a single 
talker, the next model represents an alternative strategy: having kl2 
items spoken by one talker and the remaining fc/2 items spoken by 
the other talker. As a result of having each subject view both talkers, 
the structural model for X contains no effect of talker assignment and 
no interaction of Talker Assignment x Set: 



X = (J. + C*2 + a 3 + SoVf + 2at5/fC + «7 

+ la^k + lelk. 



(14) 



TABLE A-6. Generalizability coefficients for five models of testing. 



Testing 
model 




P 2 


1 


CT 2 2 + 0- 7 Z 




<r 2 2 + <r 7 z + ej 8 2 /k + <r e 2 /k 




2 


<T 2 2 

a 2 2 + o> 2 + (T 8 2 /k + cr 9 2 /k 




3 


<T 2 2 




<j, 2 + a 2 2 + a 5 2 + o- 6 2 /k + <r 7 2 


+ CT 8 2 /k + (T e 2 /k 


4 


a 2 2 




CT 2 2 + <J 7 2 + CT 8 2 /k + <T e 2 /k 


<T 2 2 + <X 7 2 


5 


T2 2 + a-iist 2 + <T4 2 /k + a TA x Us, 2 


+ a e 2 M + a> 2 + aspect x Ust 2 + <*B 2 fo + 
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The effects of sentence set (<x 3 ), sentence within set (a 4 ), and Talker 
x Sentence (a 6 ) are constant for all subjects, so the expected 
variance of observed scores is: 

a 2 = o- 2 2 + <r 7 2 + <T 6 2 /k + <r 2 lk, (15) 

which is identical to Equation 5 for Models 1 and 2 (see Table A-4). 

The universe score is, again, the expected value of X across 
sentences and talkers, 

E[X] = U = fi + a 2 , (16) 

and the variance of universe scores is therefore o- 2 2 . Formation of the 
generalizability coefficient therefore produces the same result as for 
Model 2 (see Table A-6). Although Models 2 and 4 are not identical 
(see Tables A-3 and A-4), from the perspective of assessing 
individual differences, they generate the same generalizability func- 
tion. 

Model 5. The fifth model plotted in Figure 1 reflects a design 
wherein different subjects may receive different lists of sentences 
(i.e., subjects are nested within lists). Unfortunately, derivation of the 
generalizability coefficient for this design is not as straightforward as 



for Models 1-4. This experimental design requires a structural model 
that incorporates the main effect of list and the interaction of Talker 
Assignment x List, so the reduced statistical model cannot be used. 
A further complication is that the effect of subject nested within talker 
assignment and list is different from the effect of subject nested 
within talker assignment alone (i.e., a 2 ). The former effect equals the 
sum of three effects: subject within talker assignment, the interaction 
of Subject x Set, and the interaction of Subject x List. In addition, 
the expression for a subject's universe score contains the mean 
effect for the subject across lists, whereas the observed score 
contains the nested effect for the subject within a list. Although 
explicit derivation of the expression for the generalizability coefficient 
is beyond the scope of this paper, it can be shown that p 2 has the 
following form: 

tr 2 2 + oy 2 

~ (<>2 2 + "list 2 + + OTAxLtet 2 + V/fr ^ ^ 

+ <Tj 2 + o-SubjectxUst 2 + Oft 2 /* + a e 2 /k) 

The function for Model 5 in Figure 1 was calculated using estimated 
variance components obtained from the original statistical model for 
this experiment. 



Appendix B 

Item Statistics for Speechreading Performance (Total Words Correct) on 100 CID Sentences for the Female and Male Talkers 



Talker 

Female Male Both 



Sentence 


No. 
Words 


M 


SD 


M 


SD 


M 


SD 


1 . Walking's my favorite exercise. 


5 


1.08 


.97 


.67 


1.06 


.88 


1.03 


2. Here's a nice quiet place to rest. 


8 


.25 


.48 


.23 


.51 


.24 


.49 


3. Our janitor sweeps the floors every night. 


7 


.19 


.53 


.25 


.52 


.22 


.52 


4. It would be much easier if everyone would help. 


9 


1.75 


2.42 


2.38 


2.60 


2.07 


2.51 


5. Good morning. 


2 


.12 


.47 


1.37 


.86 


.74 


.93 


6. Open your window before you go to bed! 


8 


1.88 


1.86 


2.37 


1.98 


2.12 


1.93 


7. Do you think that she should stay out so late? 


10 


3.33 


1.68 


4.40 


2.58 


3.87 


2.23 


8. How do you feel about changing the time when 
















we begin work? 


12 


.90 


1.27 


4.21 


2.93 


2.56 


2.80 


9. Here we go. 


3 


1.08 


1.31 


.88 


1.13 


.98 


1.22 


10. Move out of the way! 


5 


3.19 


2.29 


3.67 


2.14 


3.43 


2.22 


1 1 . The water's too cold for swimming. 


7 


1.12 


1.72 


1.04 


1.99 


1.08 


1.85 


12. Why should I get up so early in the morning? 


10 


3.29 


2.53 


3.25 


2.74 


3.27 


2.62 


13. Here are your shoes. 


4 


.50 


.92 


.37 


.69 


.43 


.81 


14. It's raining. 


3 


1.02 


1.21 


.92 


.79 


.97 


1.02 


15. Where are you going? 


4 


2.44 


1.69 


2.98 


1.23 


2.71 


1.49 


16. Come here when I call you! 


6 


1.54 


1.54 


1.29 


.80 


1.41 


1.23 


17. Don't try to get out of it this time! 


10 


2.04 


2.25 


2.44 


2.52 


2.24 


2.39 


18. Should we let little children go to the movies by 














themselves? 


11 


1.85 


2.52 


1.81 


2.51 


1.82 


2.50 


19. There isn't enough paint to finish the room. 


9 


.88 


.88 


1.21 


1.61 


1.05 


1.30 


20. Do you want an egg for breakfast? 


7 


1.69 


1.53 


.79 


1.14 


1.24 


1.42 


21 . Everybody should brush his teeth after meals. 


7 


.33 


.98 


.21 


.82 


.27 


.91 


22. Everything's all right. 


4 


1.58 


1.71 


1.69 


1.70 


1.63 


1.70 


23. Don't use up all the paper when you write your 














letter. 


12 


1.19 


1.69 


3.10 


2.81 


2.14 


2.50 


24. That's right. 


3 


1.19 


1.21 


1.42 


1.45 


1.31 


1.33 


25. People ought to see a doctor once a year. 


9 


.69 


1.39 


.79 


1.04 


.74 


1.22 


26. Those windows are so dirty I can't see anything 










outside. 


11 


1.71 


2.38 


1.85 


2.29 


1.78 


2.32 


27. Pass the bread and butter please! 


6 


.79 


1.18 


1.87 


2.10 


1.33 


1.78 


28. Don't forget to pay your bill before the first of the 










month. 


13 


2.17 


3.02 


5.35 


3.41 


3.76 


3.58 
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Talker 



Female 



Male 



Sentence 



Both 



Words 


U 


SD 


M 


SD 


M 


SD 


9 


1.98 


2.59 


2.81 


2.77 


2.39 


2.70 


7 


1.58 


2.01 


1.42 


1.14 


1.50 


1.63 


5 


1.19 


1.68 


2.35 


2.26 


1.77 


2.06 


11 


2.50 


1.77 


1.85 


2.28 


2.17 


2.05 


6 


2.21 


2.02 


2.58 


1.79 


2.39 


1.91 


10 


.67 


.83 


.90 


1.27 


.79 


1.08 


7 


2.15 


1.49 


2.83 


1.58 


2.49 


1.56 


8 


1.82 


2.06 


1.88 


1.55 


1.86 


1.82 


12 


.21 


.57 


1.27 


1.39 


.74 


1.18 


11 


.40 


.87 


1.21 


2.24 


.81 


1.74 


7 


!35 


.59 


1.42 


1.92 


.88 


1.52 


3 


2.65 


.93 


2.83 


.71 


2.74 


.82 


8 


1.21 


1.29 


1.37 


1.52 


1.29 


1.41 


10 


.42 


.78 


1.06 


1.60 


.74 


1.29 


7 


1.21 


2!06 


1.27 


l!94 


1.24 


1.99 


5 


.71 


1.26 


3.21 


1.61 


1.96 


1.91 


10 


1 17 


3 27 


7.15 


2.93 


5.16 


3.68 


12 


1.62 


.84 


1.71 


1.02 


1.66 


.93 


7 


1.52 


1.96 


1.00 


1.36 


1.26 


1.70 


i 
o 


1 1Q 

1 . 1 9 


1 14 


2.42 


1.09 


1.81 


1.37 


o. 
o 


1.12 


1.06 


1.23 


1.37 


1.17 


1.22 


ft 


1 02 


1.59 


4.15 


2.04 


2.59 


2.41 




1.15 


1.30 


1.37 


1.20 


1.26 


1.25 


1 1 
i i 


1.17 


1 .26 


1.04 


1.45 


1.11 


1.36 


q 

9 


• / *J 


1 .45 


.69 


1.87 


.72 


1.67 


O 
O 


91 


Aft 
•to 


.21 


.41 


.21 


.43 


19 


9 00 

C.\J\J 


1 .72 


1.00 


1.25 


1.50 


1.58 


7 


1.11 


1.75 


3^12 


2.99 


2.12 


2.64 


6 


.71 


.98 


2.67 


2.28 


1.69 


2.01 


19 

I £ 




1 14 
i .o"t 


2.69 


3.67 


1.76 


2.90 


O 

O 


.00 


A7 




1 1ft 

1 . 1 o 




.97 


o 

c. 




.90 


4ft 
.■to 


fi7 


.71 


.87 


1 


1 77 


1 Rf\ 
I .OO 


9 1Q 


9 13 


1 QR 


2.10 


3 


.81 


1.28 


.67 


.92 


.74 


l!l1 


7 


.17 


.43 


.50 


.64 


.34 


.57 


lu 


1 DO 


O DA 


1 OA 
I .90 




1 fti 


2.77 


1 1 


.Of 


1 Ai 
1 .Hi 


1 97 


0 10 


Q7 


1 ftfi 


1 1 


1.17 


i .yo 


1 71 


0 70 


I .*to 


9 Ifi 


6 


.83 


•1 OA 

l.o4 


111 

o.lo 


0 07 


1 QR 
I .90 


9 1Q 


2 


.81 


CO 

.bo 


AO 

Ac. 


7fl 




fit) 
•Do 


13 


1.56 


2.15 


2.02 


2.77 


1.79 


2.48 


6 


2.13 


1.55 


2.52 


1.48 


2.33 


1.52 


2 


.87 


.86 


1.35 


.68 


1.11 


.81 


7 


.38 


.72 


1.37 


1.76 


.88 


1.43 


11 


2.12 


2.98 


2.46 


3.75 


2.29 


3.38 


5 


.29 


.57 


1.60 


1.36 


.94 


1.23 


11 


.79 


.98 


.62 


.97 


.70 


.97 


8 


.92 


1.91 


1.31 


2.12 


1.12 


2.02 


4 


1.81 


1.53 


2.79 


1.36 


2.30 


1.53 


3 


.13 


.34 


.29 


.61 


.21 


.50 


12 


.60 


.93 


1.60 


2.19 


1.10 


1.75 


9 


.71 


1.39 


1.12 


1.80 


.91 


1.61 


8 


.50 


1.21 


.79 


1.40 


.64 


1.31 


12 


.77 


.88 


2.12 


2.23 


1.44 


1.82 



29. Don't let the dog out of the house! 

30. There's a good ballgame this afternoon. 
It's time to go. 

If you don't want these old magazines, throw 
them out. 

Do you want to wash up? 
It's a real dark night, so watch your driving. 
I'll carry the package for you. 
Did you forget to shut off the water? 
Fishing in a mountain stream is my idea of a 
good time. 

Fathers spend more time with their children than 
they used to. 

Be careful not to break your glasses! 
I'm sorry. 

41 . You can catch the bus across the street. 

42. Call her on the phone and tell her the news. 
I'll catch up with you later. 
I'll think it over. 

I don't want to go to the movies tonight. 
If your tooth hurts that much you ought to see a 
dentist. 

Put that cookie back in the box! 
Stop fooling around! 
Time's up. 

How do you spell your name? 
Music always cheers me up. 
My brother's in town for a short while on 
business. 

We live a few miles from the main road. 
This suit needs to go to the cleaners. 
They ate enough green apples to make them sick 
for a week. 

56. Where have you been all this time? 

57. Have you been working hard lately? 

58. There's not enough room in the kitchen for a new 
table. 

Where is he? 
Look out! 

I'll see you right after lunch. 
See you later. 

White shoes are awful to keep clean. 
Stand there and don't move until I tell you! 
There's a big piece of cake left over from dinner. 
Wait for me at the corner in front of the drugstore. 
It's no trouble at all. 
Hurry up! 

The morning paper didn't say anything about rain 
this afternoon or tonight. 
70. The phone call's for you. 
Believe me! 

Let's get a cup of coffee. 
Let's get out of here before it's too late. 
I hate driving at night. 

75. There was water in the cellar after that heavy rain 
yesterday. 

76. She'll only be gone a few minutes. 

77. How do you know? 

78. Children like candy. 

79. If we don't get rain soon, we'll have no grass. 

80. They're not listed in the new phone book. 

81 . Where can I find a place to park? 

82. I like those big red apples we always get in the 
fall. 



31. 
32. 

33. 
34. 
35. 
36. 
37. 

38. 

39. 
40. 



43. 
44. 
45. 
46. 

47. 
48. 
49. 
50. 
51. 
52. 

53. 
54. 
55. 



59. 
60. 
61. 
62. 
63. 
64. 
65. 
66. 
67. 



71. 
72. 
73. 
74. 
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Appendix B continued 



Talker 

Female Male Both 

No. 

Sentence Words M SD U SD M SD 



83. You'll get fat eating candy. 


6 


.23 


.55 


.48 


.58 


.36 


.57 


84. The show's over. 


4 


.46 


.90 


.44 


.80 


.45 


.85 


85. Why don't they paint their walls some other 
















color? 


10 


.48 


1.15 


.90 


1.43 


.69 


1.31 


86. What's new? 


3 


.13 


.60 


1.35 


1.30 


.74 


1.17 


of. wfiai are yuu nioing unuor your coal: 


7 
t 


1 no 


on 




l .yo 


1 .03 


1 .04 


88. How come 1 should always be the one to go first? 


11 


1.10 


1.55 


2.58 


2.44 


1.84 


2.16 


89. I'll take sugar and cream in my coffee. 


9 


.37 


.79 


1.67 


2.09 


1.02 


1.71 


90. Wait just a minute! 


4 


.15 


.70 


.35 


1.10 


.25 


.92 


91 . Breakfast is ready. 


3 


.25 


.68 


1.23 


1.32 


.74 


1.16 


92. I don't know what's wrong with the car, but it 
















won't start. 


15 


3.85 


3.24 


7.50 


3.28 


5.67 


3.73 


93. It sure takes a sharp knife to cut this meat. 


10 


.40 


.50 


.54 


.75 


.47 


.64 


94. I haven't read a newspaper since we bought a 
















television set. 


12 


.65 


1.30 


1.81 


2.35 


1.23 


1.98 


95. Weeds are spoiling the yard. 


5 


.25 


.44 


.54 


.54 


.39 


.51 


96. Call me a little later! 


5 


.12 


.43 


.96 


1.40 


.54 


1.11 


97. Do you have change for a five-dollar bill? 


9 


2.90 


3.12 


7.83 


2.36 


5.37 


3.70 


98. How are you? 


3 


2.60 


.93 


2.77 


.70 


2.68 


.83 


99. I'd like some ice cream with my pie. 


9 


.48 


.83 


1.63 


1.50 


1.06 


1.34 
2.11 



100. I don't think I'll have any dessert. 9 4.48 2.15 6.02 1.79 5.25 



